0. Load all needed packages:

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

if (!requireNamespace("GEOmetadb", quietly = TRUE)){
  BiocManager::install("GEOmetadb")
}
if (!require("knitr", quietly = TRUE))
  install.packages("knitr")

if (! requireNamespace("edgeR", quietly = TRUE)) {
    BiocManager::install("edgeR")
}

if (!requireNamespace("ComplexHeatmap", quietly = TRUE)){
  BiocManager::install("ComplexHeatmap")
}
if (!require("kableExtra", quietly = TRUE))
  install.packages("kableExtra")
if (! requireNamespace("RCurl", quietly = TRUE)) {
  BiocManager::install("RCurl")
}
if (!require("GSA", quietly = TRUE))
  install.packages("GSA")
if (!require("VennDiagram")) {
  install.packages("VennDiagram")
}

library(GEOmetadb)
library(knitr)
library(edgeR)
library(ComplexHeatmap) 
library(circlize)
library(kableExtra)
library(RCurl)
library(GSA)
library(VennDiagram)

1. Brief Introduction of Assignment 1 Results

1.1 Loading the chosen dataset

  • The choice of dataset is GSE85995, the data file was download by using GEOmetadb package method.

  • In assignment 1, clean process was performed to remove duplicated, non-informative and weakly express data. Mapping process was performed by mappping row identifiers into the most updated HGNC symbols. The cleaned and mapped data was saved into gata3_cleaned_mapped_data.rds file. More detail code algorithm was presented in A1.Rmd file.

sfiles = getGEOSuppFiles('GSE85995')

# <clean and map, the detailed code are in "A1.Rmd"> ...

if (!file.exists("gata3_cleaned_mapped_data.rds")) {
  options(knitr.duplicate.label = 'allow')
  source(purl("A1.Rmd", output = tempfile()))
}
gata3_cleaned_maped_data <- readRDS("gata3_cleaned_mapped_data.rds")


1.2 Normalization and clean

  • TMM normalization was performed in assignment 1.
filtered_data_matrix <- as.matrix(gata3_cleaned_maped_data[,3:ncol(gata3_cleaned_maped_data)])
rownames(filtered_data_matrix) <- gata3_cleaned_maped_data$hgnc_symbols
group = c(rep("scrambled siRNA",3),rep("GATA3 siRNA",3))
d = DGEList(counts=filtered_data_matrix, group = group)
d = calcNormFactors(d)
normalized_cleaned_gata3 <- cpm(d)
  • Density plots were compared for the cleaned and mapped data before and after normalization. The data before normalization followed a normal distribution. After normalization, the pattern remained and the edges of the data were cleaned up slightly.
# before
data2plot <- log2(cpm(gata3_cleaned_maped_data[,3:ncol(gata3_cleaned_maped_data)]))
counts_density <- apply(data2plot, 2, density)
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
  xlim <- range(c(xlim, counts_density[[i]]$x)); 
  ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n", 
     ylab="Smoothing density of log2-CPM", 
     main = "Figure 1.1: Pre-normalization Density Plot",
     cex.lab = 0.85)
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
legend("topright", colnames(data2plot),  
       col=cols, lty=ltys, cex=0.75, 
       border ="blue",  text.col = "green4", 
       merge = TRUE, bg = "gray90")

# after
normalized_data2plot <- log2(normalized_cleaned_gata3)
normalized_counts_density <- apply(normalized_data2plot, 2, density)
xlim <- 0; ylim <- 0
for (i in 1:length(normalized_counts_density)) {
  xlim <- range(c(xlim, normalized_counts_density[[i]]$x)); 
  ylim <- range(c(ylim, normalized_counts_density[[i]]$y))
}
cols <- rainbow(length(normalized_counts_density))
ltys <- rep(1, length(normalized_counts_density))
plot(normalized_counts_density[[1]], xlim=xlim, ylim=ylim, type="n",      
     ylab="Smoothing density of log2-CPM", 
     main = "Figure 1.2: Post-normalization Density Plot",
     cex.lab = 0.85)
for (i in 1:length(normalized_counts_density)) lines(normalized_counts_density[[i]], col=cols[i], lty=ltys[i])
legend("topright", colnames(normalized_data2plot), 
       col=cols, lty=ltys, cex=0.75, 
       border ="blue",  text.col = "green4", 
       merge = TRUE, bg = "gray90")
Figure 1: GATA3 RNASeq Samples Density Plots Comparison by using TMM Normalization.Figure 1: GATA3 RNASeq Samples Density Plots Comparison by using TMM Normalization.

Figure 1: GATA3 RNASeq Samples Density Plots Comparison by using TMM Normalization.

  • The normalized data was saved into gata3_cleaned_normalized.rds file.
if (!file.exists("gata3_cleaned_normalized.rds")) {
  saveRDS(normalized_cleaned_gata3, "gata3_cleaned_normalized.rds")
}

2. Brief Introduction of Assignment 2 Results

2.1 Differential Gene Expression:

  • The normalized data was initially visualized by using a heatmap and MDS plot to select the proper model for the differential gene expression task. C represents the control condition and G represents the testing condition. The initial heat map has roughly clustering by cell type, but the pattern of genes division is not very apparent. In the MDS plots, the cell type model demonstrated a closer distance within the same group than the patient type model. Then, the construction of simple and complex models was designed based on cell type.
    Figure 2: Initial heat map for the normalized GATA3 data

    Figure 2: Initial heat map for the normalized GATA3 data

Figure 3: GATA3 RNASeq Samples MDS Plots Comparison by using Two ModelsFigure 3: GATA3 RNASeq Samples MDS Plots Comparison by using Two Models

Figure 3: GATA3 RNASeq Samples MDS Plots Comparison by using Two Models


  • Two models were designed: simple (cell type only) and complex (cell type and patient variation). The lmFit() method in the limma package was applied to perform the Benjamini-Hochberg FDR multiple hypothesis testing at the threshold of 0.05. The threshold value is also the same value used in the original paper.

  • According to the results in Figure 4, the gene of interest was selected in both models with close to 0 p-values. The majority of genes stood not significant. For the differential expressed gene (DEG) from both models, the p-values are fundamentally below the threshold of 0.05.


  • The complex model was used for the edgeR QLF test additionally. Similar to the result from limma, the edge results also have a large portion of non-significant genes as the grey region in Figure 5. But the amount of differentially expressed genes selected by the edgeR method was less than that of limma. GATA3 was also selected by the edgeR method with an expected low p-value.

  • To quantitatively summarize the results from both methods:

    • In the limma method, there were 6378 DEGs with the threshold of 0.05, which is 51.17% of genes in the cleaned and mapped data.

    • In the edgeR method, there were 1419 DEGs with the threshold of 0.05, which is 11.38% of genes in the cleaned and mapped data.

    • After the p-value correction by the Benjamini-Hochberg method, 5311 genes passed in the limma method, which has the 42.61% of the selected differentially expressed genes. From the chosen model in the edgeR method, 452 genes passed the correction. The percentage was 3.63%.


  • Eventually, due to the design target of the edgeR method algorithm being more conforming to the sample type, the edgeR method generated results were used to perform the later tasks.


  • The volcano plot in Figure 6 displays the division of up-regulated and down-regulated genes from the edgeR result. A clear separation for up-regulated and down-regulated genes occurred based on the logFC.

  • The top hits from the edgeR results were used to generate the heat map in Figure 7. The graph presents clustering based on cell type. The clustered patterning is more clear compared to the initial heatmap.

    Figure 7: Normalized GATA3 RNASeq Samples Heat Maps by using Complex Model in edgeR

    Figure 7: Normalized GATA3 RNASeq Samples Heat Maps by using Complex Model in edgeR


2.2 Thresholded over-representation analysis

  • The results of the DEGs list were saved as three txt files: up-regulated genes list, down-regulated genes list and all ranked genes list. The method to distinguish between up-regulated and down-regulated genes is based on FDR values.
qlf_output_hits_withgn <- merge(rownames(gata3_data),qlf_output_hits, by.x=1, by.y = 0)
colnames(qlf_output_hits_withgn)[1] <- "hgnc_symbol"
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * 
                                        sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
up_regulated <- qlf_output_hits_withgn$hgnc_symbol[which(qlf_output_hits_withgn$PValue < 0.05
                                                   & qlf_output_hits_withgn$logFC > 0)]
down_regulated <- qlf_output_hits_withgn$hgnc_symbol[which(qlf_output_hits_withgn$PValue < 0.05
                                                     & qlf_output_hits_withgn$logFC < 0)]
if(!file.exists("gata3_upregulated_genes.txt")){
  write.table(x=up_regulated,
            file=file.path("data","gata3_upregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
}

if(!file.exists("gata3_downregulated_genes.txt")){
  write.table(x=down_regulated,
            file=file.path("data","gata3_downregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
}
if(!file.exists("gata3_ranked_genelist.txt")){
  write.table(x=data.frame(genename= qlf_output_hits_withgn$hgnc_symbol,
                         F_stat=qlf_output_hits_withgn$rank),
            file=file.path("data","gata3_ranked_genelist.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
}


  • g:profiler (Raudvere et al. 2019) was used for performing the threshold analysis for the up-regulated and down-regulated genes lists. The annotation sources used were GO biological process, Reactome and WikiPathways. The threshold was 0.05 in Benjamini-Hochberg FDR. The retrieved results are:
Table 1.1: The up-regulated genes list result
Resource Top Term \(T\) \(Q\) \(T \cap Q\)
Go: BP regulation of chromosome organization 192 834 30
REAC Mitotic Prometaphase 199 566 35
WP Retinoblastoma gene in cancer 88 466 32


Table 1.2: The down-regulated genes list result
Resource Top Term \(T\) \(Q\) \(T \cap Q\)
Go: BP ribosomal large subunit biogenesis 72 527 18
REAC rRNA processing in the nucleus and cytosol 192 399 40
WP Metabolic reprogramming in colon cancer 44 341 16


Table 1.3: All DEG genes list result
Resource Top Term \(T\) \(Q\) \(T \cap Q\)
Go: BP regulation of chromosome organization 192 1361 41
REAC rRNA processing in the nucleus and cytosol 192 965 47
WP Retinoblastoma gene in cancer 88 787 36



3. Non-thresholded Gene set Enrichment Analysis

3.1 GSEA input files and parameters

  • The ranked list saved in the assignment 2 was used as rank list input in non-threshold gene sets enrichment analysis.

  • Firstly, I converted the ranked txt file into an rnk file as valid GSEA input.

txt_file <- read.delim("data/gata3_ranked_genelist.txt", header = FALSE)
write.table(x=txt_file, file=file.path("data","gata3_ranked_genelist.rnk"),sep = "\t", 
            row.names = FALSE,col.names = c('GeneName','rank'),quote = FALSE)
  • The gene database used was Human_GOBP_AllPathways_no_GO_iea_March_01_2022_symbol.gmt from the Bader lab (Merico et al. 2010), which is the latest version of human symbol in the resources.
# go to the current_release for finding the latest genesets for human
gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)",
              contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest <- file.path(paste(getwd(),'data', sep="/"),gmt_file)
download.file(paste(gmt_url,gmt_file,sep=""), destfile=dest)
  • The parameters used are:

    • Collapse = No_Collapse
    • max gene set size = 200
    • min gene set size = 15
  • The GSEA was performed externally by using the above files and configurations in the GSEA software.

  • The initial html of the GSEA results: index.html

3.2 Short Summary of the Results:

1. What method did you use? What genesets did you use? Make sure to specify versions and cite your methods.

  • I used GSEA Preranked method in GSEA 4.2.3 software.

  • The geneset database used was the latest version of the human symbol from the Bader lab gene sets collections (Human_GOBP_AllPathways_no_GO_iea_March_01_2022_symbol.gmt) (“Enrichment Map Gene Sets” 2016). The query used the getURL() method in the RCurl package to obtain the file (Temple Lang 2022). Applied the regex expression to find the latest updated file including GO biological pathway dataset and download it into the data folder for use.

  • The parameter for setting GSEA Preranked task was listed on above.

2. Summarize your enrichment results.

  • The reason for regulating the geneset size range from 200 to 15 is to obtain more specific meaningful pathways and avoid extreme values affecting the results. The choice of not collapse is due to the uniqueness of all gene symbols in the ranked list file. In the results generated by assignment 1, all hgnc symbols saved in the normalized data were verified to be unique. This setting is also the recommended parameter setting for a ranked gene list with unique human gene symbols in the user guide of GSEA (Subramanian et al. 2005).

  • The top genesets returned summarized table is:

    Table 2: The top gene sets in GSEA results

    Type

    Top geneset

    Size

    P-value

    ES

    NES

    FDR

    Top gene

    na_pos

    CYTOSKELETON-DEPENDENT CYTOKINESIS%GOBP%GO:0061640

    88

    0

    0.69

    2.03

    0.014

    STMN1

    na_neg

    VALIDATED TARGETS OF C-MYC TRANSCRIPTIONAL ACTIVATION%PATHWAY INTERACTION DATABASE NCI-NATURE CURATED DATA%VALIDATED TARGETS OF C-MYC TRANSCRIPTIONAL ACTIVATION

    75

    0

    -0.79

    -2.24

    0.000

    UBTF

3. How do these results compare to the results from the thresholded analysis in Assignment 2. Compare qualitatively. Is this a straight forward comparison? Why or why not?

  • Compared with the assignment 2 results, the top genesets returned in the enrichment analysis were not identical. For the up-regulated genes, the second-ranked geneset (MITOTIC PROMETAPHASE%REACTOME DATABASE ID RELEASE 79%68877) was the top geneset in the assignment 2 Reactome result. The top geneset generated from the GSEA result was the 12th of GO:BP results in assignment 2. Similar to the down-regulated genes, the third GSEA ranked geneset (rRNA PROCESSING IN THE NUCLEUS AND CYTOSOL%REACTOME%R-HSA-8868773.3) was the top hit in the assignment 2 Reactome result.

  • This comparison is straightforward. The ranked gene content imported in the enrichment analysis is the same as the normalized data but additionally assigned with ranking scores. The annotation source of GO: BP is also commonly applied in both analyses. Also in the parameter settings, I used the same range of 15 to 200 to limit the genome size in both threshold and non-threshold analyses. However, the top hit genesets are relevant but not identical.


4. Visualize your Gene set Enrichment Analysis in Cytoscape

4.1 Create an enrichment map

The enrichment map was created externally using Cytoscape software (Shannon et al. 2003) and EnrichmentMap Cytoscape App (Merico et al. 2010). The network visualization pictures were saved into the data folder.

How many nodes and how many edges in the resulting map? What thresholds were used to create this map? Make sure to record all thresholds. Include a screenshot of your network prior to manual layout.

  • Cytoscape 3.9.1 and EnrichmentMap Cytoscape App 3.3.3 were used to perform the GESA results visualization.

  • The parameter for generating the enrichment map is:

    • FDR q_value cutoff= 0.1
    • p-value for node cutoff = 0.05; for edge cutoff = 0.375
  • 186 gene sets (nodes) and 496 overlaps between gene sets (edges) present on the network.

  • The picture of current network: network.

4.2 Annotate your network

The annotation setup was created externally using AutoAnnotate App (Kucera et al. 2016) in Cytoscape software.

What parameters did you use to annotate the network. If you are using the default parameters make sure to list them as well.

  • The annotation used was default from AutoAnnotate app 1.3.5 in Cytoscape. The choice of cluster annotation model is MCL Cluster Annotation Set by default. The label column is by GS_DESCR (gene description). Create singleton clusters and layout network to prevent cluster overlap are selected.

  • The screenshot of annotated network: annotated_network


4.3 Make a publication ready figure

  • The publication figure will be:
Figure 8: Publication ready figure for enrichment results

Figure 8: Publication ready figure for enrichment results

4.4 Collapse your network to a theme network

What are the major themes present in this analysis? Do they fit with the model? Are there any novel pathways or themes?

  • The two largest clusters of the down-regulated genesets on the network are process purine biosynthetic and major rRNA nucleus. They contain 23 nodes and 14 nodes respectively. On the other side, protein centrosome mitotic was the largest up-regulated genesets cluster in Figure 8. It contains 10 nodes. These clusters would be considered a theme network.

  • Comparing to the g:profiler results, the major rRNA nucleus cluster contain the down-regulated geneset top hits from Reactome (rRNA processing in the nucleus and cytosol REAC:R-HSA-8868773) and GO:BP (ribosomal large subunit biogenesis GO:0042273). Also in the up-regulated genesets, the Reactome top hit (Mitotic prometaphase REAC:R-HSA-68877) was included in the protein centrosome mitotic cluster.

  • A typical novel pathway would be tubule development morphogenesis, which was not present in the g:profiler results.


5. Interpretation and detailed view of results


1. Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment #2 thresholded methods

  • The original paper performed DEG analysis and cellular function analysis network for GATA3 knockdown trophoblasts. The conclusion is the regulation of GATA3 for trophoblast cells in migration and invasion aspects (Lee et al. 2016). The non-threshold methods results contain the tubule development morphogenesis cluster. The largest geneset in the cluster, tissue morphogenesis, includes GATA3 and MYC at the top leading genes, which conform to the conclusion of MYC being one of the down-regulated genes mentioned in the paper. Most of the pathways in this cluster also contain MYC as leading genes.

  • The top hits obtained from the two methods were different genesets, but the several top hits from the GSEA results contain the top hits from the threshold method results. New pathways have also emerged in network visualization, such as tubule development morphogenesis.

2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result?

  • The major rRNA nucleus cluster is one of the major themes in the network. The highest-ranked gene in this cluster is eIF4A. The evidence I found is the correlation of eIF4A for trophoblast cells activities. It was reported to involve the restricted global protein synthesis in trophoblast cells by interacting with AMOT (Enrichment of angiomotin). Over AMOT expression in the placenta was associated with intrauterine growth restriction in both rats and humans (Basak et al. 2020).

  • The evidence could provide the involvement of major rRNA nucleus clusters in the effect of trophoblast cells activities.


6. Dark Matter

6.1 Generate dark matter overlaps graph

  • We will sort out the DEGs selected from the project but not shown in any pathways as dark matter analysis. Firstly, choose the same database for genesets and load the files.
# use the same gene database
gmt_file <- file.path("data", "Human_GOBP_AllPathways_no_GO_iea_March_01_2022_symbol.gmt")
capture.output(genesets <- GSA.read.gmt(gmt_file), file = "gsa_loud.out")
names(genesets$genesets) <- genesets$geneset.names

# load expression file and rank scores
expression <- readRDS("gata3_cleaned_normalized.rds")
ranks <- ranks <- read.table(file.path(getwd(), "data", "gata3_ranked_genelist.txt"),
                    header = FALSE, sep = "\t", quote = "\"",
                    stringsAsFactors = FALSE)

# get gsea result files
gseaDirectories <- list.files(path = file.path(getwd(),"data/gsea_results"), 
                                 pattern = "\\.GseaPreranked")
if(length(gseaDirectories) == 1){
  gseaDir <- file.path(getwd(),"data/gsea_results")
  gseaResultsFiles <- list.files(path = gseaDir, 
                                 pattern = "gsea_report_*.*.tsv")
  enrFile1 <- read.table(file.path(gseaDir,gseaResultsFiles[1]), 
                        header = TRUE, sep = "\t", quote="\"",  
                        stringsAsFactors = FALSE,row.names=1)
  enrFile2 <- read.table(file.path(gseaDir,gseaResultsFiles[1]), 
                        header = TRUE, sep = "\t", quote="\"",  
                        stringsAsFactors = FALSE,row.names=1)
}
  • Collect data from the enrichment results
FDR_threshold <- 0.001
all_sig_enr_genesets<- c(rownames(enrFile1)[which(enrFile1[,"FDR.q.val"]<=FDR_threshold)], rownames(enrFile2)[which(enrFile2[,"FDR.q.val"]<=FDR_threshold)])
genes_sig_enr_gs <- c()
for(i in 1:length(all_sig_enr_genesets)){
  current_geneset <- unlist(genesets$genesets[which(genesets$geneset.names %in% all_sig_enr_genesets[i])]) 
  genes_sig_enr_gs <- union(genes_sig_enr_gs, current_geneset)
}

genes_all_gs <- unique(unlist(genesets$genesets))


  • Visualize into Venn Diagram and save into dark_matter_overlaps.png file:
A <- genes_all_gs # all genes in the gmt file
B <- genes_sig_enr_gs # all genes in the enrichment result
C <- rownames(expression) # all genes in the expression

venn_diagram_path <- file.path(getwd(),"data","dark_matter_overlaps.png")
png(venn_diagram_path)
draw.triple.venn( area1=length(A), area2=length(B), area3 = length(C),
                  n12 = length(intersect(A,B)), n13=length(intersect(A,C)),
                  n23 = length(intersect(B,C)), 
                  n123 = length(intersect(A,intersect(B,C))),
                  category = c("all genesets","all enrichment results","expression"),
                  fill = c("red","green","blue"),
                  cat.col = c("red","green","blue"))
dev.off()
Figure 9: Dark Matter Overlap Venn Diagram. The

Figure 9: Dark Matter Overlap Venn Diagram. The


6.2 Genes with no annotation

  • Collect the genes with no annotation and list out the top 10 genes. Make queries on Uniprot (Bateman et al. 2020) for the checking of dark matter top results:
# genes without annotation
genes_no_annotation <- setdiff(C, A)

#purple area: 1553 genes
length(genes_no_annotation)
## [1] 1553
ranked_gene_no_annotation <- ranks[which(ranks[,1] %in% genes_no_annotation),]

kable(ranked_gene_no_annotation[1:10,], format = 'html', caption = "Table 3.1: The Top 10 ranked genes with no annotation", col.names = c("Gene","Rank score"), escape=FALSE,align=c(rep('c',times=2)))%>%
  kable_styling(full_width = T)
Table 3.1: The Top 10 ranked genes with no annotation
Gene Rank score
45 RCN2 -8.051095
130 MROH6 -4.092775
177 CCDC86 -3.315124
302 DANCR -2.103717
322 NT5DC2 -2.020946
326 YDJC -2.008087
331 MIR137HG -1.989380
353 CTXN1 -1.886112
365 NME1-NME2 -1.839917
423 HPCAL1 -1.601521


  • 12.4598845% of genes were not found annotation, which is not a large portion and expected. Due to the normalization and mapping, not too much genes were not included in the latest version of gmt database.

  • The top gene RCN2 is shown to be associated with Reticulocalbin-2 protein which has GO annotation from GO_Central Uniprot query for RCN2. If we zoom in on the regulatory role of RCN2, it has been shown to regulate blood pressure and contribute to hypertension by affecting the endothelial NO synthases (chang_yan_yan_zheng_liu_2019?). It might not have strong correlations with trophoblast cells than the genesets included in the results.

  • The second gene MROH6 (Maestro Heat Like Repeat Family Member 6) has no annotation (Uniprot query for MROH6), which conforms to the feature of this group (no annotation). It is associated with autosomal recessive non-syndromic intellectual disability and annotated with binding in GO (mroh6_database?).


  • Visualize in heat map:
    Figure 10: Heat map of genes with no annotation

    Figure 10: Heat map of genes with no annotation


6.3 Genes with annotation but not in enrichment analysis

# genes without annotation and in enrichment analysis
genes_annotated_enr_non_sig <- setdiff(C, B)

# the purple and fuchsia areas: 1553 + 10395 genes
length(genes_annotated_enr_non_sig)
## [1] 11948
ranked_genes_annotated_enr_non_sig <- ranks[which(ranks[,1] %in% genes_annotated_enr_non_sig),]

kable(ranked_genes_annotated_enr_non_sig[1:10,], format = 'html', caption = "Table 3.2: The Top 10 ranked genes with annotation and not significant in enrichment analysis", col.names = c("Gene","Rank score"), escape=FALSE,align=c(rep('c',times=2)))%>%
  kable_styling(full_width = T)
Table 3.2: The Top 10 ranked genes with annotation and not significant in enrichment analysis
Gene Rank score
1 TGM2 -27.21846
2 CCN1 -27.00223
4 TCEA1 -22.05397
5 KRT18 -20.42065
7 KRT8 -17.27867
8 KRTAP2-3 -16.10960
9 SPARC -15.87424
10 EIF5A -15.68808
11 CLDN11 -15.66221
14 ZFP36L1 -14.43692


  • 95.860077% of genes were evaluated as non-significant in enrichment results but present in expression, which is a large portion.

  • Since the use of no_GO_iea version of gmt file as the database, the large portion of genes in this category might be due to the limitation of the annotation source. We could perform another enrichment analysis with a more election source tolerated database.

  • The top gene TGM2 has a list of annotation sources including Uniprot and Ensembl (Uniprot query for TGM2). It is related to the Protein-glutamine gamma-glutamyltransferase 2 protein involved in multiple biological process including bone development, angiogenesis, wound healing, cellular differentiation, chromatin modification and apoptosis (Bioinformatics 2022). The reason for being excluded might due to the board correlation.

  • The second gene CCN1 has extracellular matrix binding function annotated in Ensembl (Uniprot query for CCN1). The missing of CCN1 might be due to the annotation source choice.


Figure 11: Heatmap for genes with annotation but not in enrichment results

Figure 11: Heatmap for genes with annotation but not in enrichment results


6.4 Short summary for heat maps

  • In the significant genes that are not annotated to any of the pathways in the enrichment analysis, the pattern of heat map is not clear as the heat map in assignment 2 results. These genes are no

  • In the significant genes that are not annotated and returned in enrichment analysis, the pattern of heatmap is more clear and approach to the assignment 2 heatmap results.


References

Alberts, Bruce. 1970. “Cell Movements and the Shaping of the Vertebrate Body.” Molecular Biology of the Cell. 4th Edition. U.S. National Library of Medicine. https://www.ncbi.nlm.nih.gov/books/NBK26863/.
Basak, Trishita, Amit Kumar Dey, Rachana Banerjee, Sandip Paul, Tushar Kanti Maiti, and Rupasri Ain. 2020. “Sequestration of Eif4a by Angiomotin: A Novel Mechanism to Restrict Global Protein Synthesis in Trophoblast Cells.” Stem Cells 39 (2): 210–26. https://doi.org/10.1002/stem.3305.
Bateman, Alex, Maria-Jesus Martin, Sandra Orchard, Michele Magrane, Rahat Agivetova, Shadab Ahmad, Emanuele Alpi, et al. 2020. “Uniprot: The Universal Protein Knowledgebase in 2021.” Nucleic Acids Research 49 (D1). https://doi.org/10.1093/nar/gkaa1100.
Bioinformatics, UniProt ConsortiumEuropean Bioinformatics InstituteProtein Information ResourceSIB Swiss Institute of. 2022. “Protein-Glutamine Gamma-Glutamyltransferase 2.” UniProt ConsortiumEuropean Bioinformatics InstituteProtein Information ResourceSIB Swiss Institute of Bioinformatics. https://www.uniprot.org/uniprot/P21980.
“Enrichment Map Gene Sets.” 2016. GeneSets - Bader Lab @ The University of Toronto. http://baderlab.org/GeneSets.
Kucera, Mike, Ruth Isserlin, Arkady Arkhangorodsky, and Gary D. Bader. 2016. “AutoAnnotate: A Cytoscape App for Summarizing Networks with Semantic Annotations.” F1000Research 5: 1717. https://doi.org/10.12688/f1000research.9090.1.
Lee, B., L. L. Kroener, N. Xu, E. T. Wang, A. Banks, J. Williams, M. O. Goodarzi, et al. 2016. “Function and Hormonal Regulation of Gata3 in Human First Trimester Placentation.” Biology of Reproduction 95 (5): 113–13. https://doi.org/10.1095/biolreprod.116.141861.
Merico, Daniele, Ruth Isserlin, Oliver Stueker, Andrew Emili, and Gary D. Bader. 2010. “Enrichment Map: A Network-Based Method for Gene-Set Enrichment Visualization and Interpretation.” PLoS ONE 5 (11). https://doi.org/10.1371/journal.pone.0013984.
Raudvere, Uku, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and Jaak Vilo. 2019. “G:profiler: A Web Server for Functional Enrichment Analysis and Conversions of Gene Lists (2019 Update).” Nucleic Acids Research 47 (W1). https://doi.org/10.1093/nar/gkz369.
Shannon, Paul, Andrew Markiel, Owen Ozier, Nitin S. Baliga, Jonathan T. Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13 (11): 2498–2504. https://doi.org/10.1101/gr.1239303.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.
Temple Lang, Duncan. 2022. RCurl: General Network (HTTP/FTP/...) Client Interface for r. https://CRAN.R-project.org/package=RCurl.